Introduction and General Background


This document is designed to

Sources of data

Key issues and limitations

Note on PUMAs.


Excel (and .csv) files generated for your use include:



Example of using some of these data: California Population Size Pie Charts



Detailed Background


While the Federal government Office of Management and Budget specifies minimum standards for collection and reporting of “broad” race and ethnicity data, more detailed data are collected and reported by the US Census Bureau (see Figure above). In addition to the 18 non-mutually exclusive specific racial and ethnic groups listed on the Census data collection form, respondents are provided with the option to write in their specific identity in a free text field. (2020 Census Race and Hispanic Origin Improvements). Population data for select subgroups of these detailed data are available in both the US Census decennial and American Community Survey data tables.


Efforts to explore health outcomes at more granular levels are necessary since significant heterogeneity exists within the larger, aggregated racial and ethnic groupings. For example, mortality rates, socioeconomic status, and COVID-19 and tuberculosis rates all vary widely in California for groups listed within the Asian population (see data brief here). Similar levels of within group diversity can be found within all of the OMB minimum classification groups. When data are aggregated, information is lost and smaller, often marginalized groups can become hidden, perpetuating longstanding disparities. Conversely, analysis that considers the social and cultural context to explore more specific, disaggregated “sub-groups” can inform public health strategies. Interventions tailored to the unique obstacles and challenges faced by specific communities will be more effective in addressing the root causes of health disparities and result in more equitable outcomes.


Disaggregated race and ethnicity for specific outcomes including deaths cases, and other health indicators is essential to forming a complete and accurate understanding of the health status across California’s diverse populations. Matching these counts of health outcomes with similarly disaggregated population “denominator” data can facilitate comparison across groups to calculate population-based rates and identify disparities. This document provides an overview of the available population data that can be used by health programs across CDPH to calculate rates of disease across population subgroups defined by detailed race and ethnicity. Towards this end, recent California legislation has mandated that CDPH disaggregate all Asian and Pacific Islander data, in the collection and tabular presentation of data (AB 1726, Gov Code 8310.7 (b)).


The interpretation of detailed race and ethnicity population data, detailed numerator or “event” data, and the “alignment” of such denominators and numerators, depends greatly on issues related to the collection of such data. The interpretation of such data depends heavily on how the data are collected, particularly since race and ethnicity are based on respondent’s self-identification and may not be consistent over time. Furthermore, the options provided to a respondent at the time of data collection often vary across surveys. This creates situations in which the same person can be classified differently in two datasets. This differential classification between numerator and denominator data sources can bias race and ethnicity-specific estimates. For example, an individual who identifies as Filipino may check the “Filipino” box on the Census form, but select “Pacific Islander” on a form that does not have detailed Asian or Pacific Islander options.



Appendix


R code to extract data from ACS PUMS

Note: A Census API key is required, which can be obtained here.

# Load in census api key
census_api_key("your_census_api_key_here")  # link to get an api key: https://api.census.gov/data/key_signup.html 


# Pull raw pums data
detailed_re_pums <- get_pums(
  state = "CA", 
  variables = c("RAC1P", "RAC2P", "HISP", "AGEP"),
  survey = "acs5",
  year = 2019, 
  recode = TRUE,
  rep_weights = "person"
)

# Save data
saveRDS(detailed_re_pums, "data in\rawPUMS_detailedRE_age.RDS")


R code to manipulate data

Note: The standard package for calculating estimates from complex survey objects is the survey package. The srvyr package is an alternative package which wraps some survey functions to allow for analyzing surveys using dplyr-style syntax. tidycensus provides a function, to_survey(), that converts data frames returned by get_pums() into either a survey or srvyr object.

In order to generate reliable standard errors, the Census Bureau provides a set of replicate weights for each observation in the PUMS dataset. These replicate weights are used to simulate multiple samples from the single PUMS sample and can be used to calculate more precise standard errors. PUMS data contains both person- and housing-unit-level replicate weights.

Replicate weights at the person-level were included in the PUMS data extraction above by setting the rep_weights argument in get_pums() to “person”.

# Read raw pums data
detailed_re_pums <- readRDS("data in/rawPUMS_detailedRE_age.RDS")

# Process data
detailed_re_pums_survey <- to_survey(detailed_re_pums)

detailed_re <- detailed_re_pums_survey %>%
  mutate(ageGroup = cut(AGEP, breaks = c(0, 1, seq(5, 85, by = 10), 199), include.lowest = T, right = F, 
                        labels = c("0", "1 - 4", "5 - 14", "15 - 24", "25 - 34", "35 - 44", "45 - 54", "55 - 64", "65 - 74", "75 - 84", "85+")),
         detailedRE = ifelse(HISP_label == "Not Spanish/Hispanic/Latino", as.character(RAC2P_label), as.character(HISP_label)), 
         reGroup = case_when(
           HISP_label != "Not Spanish/Hispanic/Latino" ~ "Latino", 
           RAC1P_label %in% c("Alaska Native alone", 
                              "American Indian alone", 
                              "American Indian and Alaska Native tribes specified; or American Indian or Alaska Native, not specified and no other races") ~ "American Indian/Alaska Native", 
           TRUE ~ as.character(RAC1P_label)
         )
    ) %>%
  survey_count(ageGroup, detailedRE, reGroup, name = "population", vartype = "ci")

# Verify data - Check total population
sum(detailed_re$population)
# Matches total CA population estimate in PUMS (https://www.census.gov/programs-surveys/acs/microdata/documentation.2019.html#list-tab-DO3IVWNQPH4UCXIU03)



# Save processed data
saveRDS(detailed_re, "data in/CA_pop_detailed_RE_age.RDS")

write.csv(detailed_re, "data out/ca_pop_detailed_RE_age_pums.csv", row.names = FALSE)

library(openxlsx)
write.xlsx(detailed_re, "data out/ca_pop_detailed_RE_age_pums.xlsx")


R code to extract data from American Community Survey (ACS)

The code below pulls 2015-2019 ACS 5-year county-level population estimates for Asian Alone by selected groups (table B02015), Native Hawaiian and Other Pacific Islander alone by selected groups (table B02016), and Hispanic or Latino origin by specific origin (table B003001). A list of the available detailed race and ethnicity groups and their corresponding ACS variable/table IDs can be found in the detailedRE_acs_link.xlsx file.

These ACS 5-year data pulled below have two key differences from the ACS PUMS data pulled above:

  1. The ACS 5-year detailed race or ethnicity estimates below are at the county level, while the ACS PUMS estimates above are at the state level. The lowest geographical unit these data are available at is census tract level for ACS 5-year, and PUMA for ACS PUMS.
  2. Unlike ACS PUMS, the ACS 5-year detailed Asian and NH/PI estimates below are not stratified by Hispanic or Latino origin.
# census_api_key("your_census_api_key_here"")

library(readxl)
acsDetailedAsian <- read_xlsx("data in/detailedRE_ACS_link.xlsx", sheet = "Detailed Asian")
acsDetailedNHPI <- read_xlsx("data in/detailedRE_ACS_link.xlsx", sheet = "Detailed NHPI")
acsDetailedLatino <- read_xlsx("data in/detailedRE_ACS_link.xlsx", sheet = "Detailed Latino")


ourGet <- function(ourVariable = acsDetailedAsian) {
  get_acs(geography = "county", state = 06, year = 2019, survey = "acs5", moe_level = 90, variables = ourVariable$acsID) %>% 
  left_join(ourVariable, by = c("variable" = "acsID")) %>% 
  select(GEOID, county = NAME, acsID = variable, raceGroup, detailedRace, population = estimate, moe)
}

detailedAsianPop  <- ourGet(acsDetailedAsian)  
detailedNHPIPop   <- ourGet(acsDetailedNHPI)  
detailedLatinoPop <- ourGet(acsDetailedLatino)  


detailed_re_acs <- bind_rows(detailedAsianPop, detailedNHPIPop, detailedLatinoPop)

library(openxlsx)
write.xlsx(detailed_re_acs, "data out/county_pop_detailed_RE_acs.xlsx")
write.csv(detailed_re_acs, "data out/county_pop_detailed_RE_acs.csv" row.names = FALSE)